Web Content Categorization Using Link Information

نویسندگان

  • Zoltán Gyöngyi
  • Hector Garcia-Molina
  • Jan Pedersen
چکیده

Document categorization is one of the foundational problems in (web) information retrieval. Even though web documents are hyperlinked, most proposed classification techniques take little advantage of the link structure and rely primarily on text features, as it is not immediately clear how to make link information intelligible to supervised machine learning algorithms. This paper introduces a link-based approach to classification, which can be used in isolation or in conjunction with text-based classification. Various large-scale experimental results indicate that link-based classification is on par with text-based classification, and the combination of the two offers the best of both worlds.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Topic Continuity for Web Document Categorization and Ranking

PageRank is primarily based on link structure analysis. Recently, it has been shown that content information can be utilized to improve link analysis. We propose a novel algorithm that harnesses the information contained in the history of a surfer to determine his topic of interest when he is on a given page. As the history is unavailable until query time, we guess it probabilistically so that ...

متن کامل

Using neighborhood information for automated categorization of Web pages

In this paper we discuss several issues related to the influence of expansion of a Web document representation on quality of topical categorization of Web pages. We consider a Web page expansion by using text content of it’s linking pages. We show that naive expansion can grab too much noise and essentially harm categorization results. We present the approach to automated pruning of linking Web...

متن کامل

Web Page Categorization using Multilayer Perceptron with Reduced Features

The web is a huge repository of knowledge and numerous hyperlinks. Web also serves a broad diversity of user communities and global information service centers. Every day the knowledge in web page upwards rapidly. Web pages can be used to convey the knowledge to web users. Such voluminous size of the web makes an intricacy of web information retrieval, web content filtering and web structure mi...

متن کامل

A Novel Approach for Text Categorization of Unorganized data based with Information Extraction

Internet has made a profound change in the lives of many enthusiastic innovators and researchers. The information available on the web has knocked the doors of Knowledge Discovery leading to a new Information era. Unfortunately, most Search Engines provide web content which is irrelevant to the information intended to the browser. Many Text Categorization techniques for web content have been de...

متن کامل

Categorization of web pages - Performance enhancement to search engine

With the advent of technology man is endeavoring for relevant and optimal results from the web through search engines. Retrieval performance can often be improved using several algorithms and methods. Abundance in web has impelled to exert better search systems. Categorization of the web pages abet fairly in addressing this issue. The anatomy of the web pages, links, categorization of text and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006